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ABSTRACT 

Summary: Finding significant differences between the expression 
levels of genes or proteins across diverse biological conditions is 
one of the primary goals in the analysis of functional genomics data. 
However, existing methods for identifying differentially expressed 
genes or sets of genes by comparing measures of the average 
expression across predefined sample groups do not detect 
differential variance in the expression levels across genes in 
cellular pathways. Since corresponding pathway deregulations occur 
frequently in microarray gene or protein expression data, we present 
a new dedicated web application, PathVar, to analyze these data 
sources. The software ranks pathway-representing gene/protein sets 
in terms of the differences of the variance in the within-pathway 
expression levels across different biological conditions. Apart from 
identifying new pathway deregulation patterns, the tool exploits these 
patterns by combining different machine learning methods to find 
clusters of similar samples and build sample classification models. 
Availability: freely available at http://pathvar.embl.de 
Contact: |enrico.glaab@uni.lu1 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

In the search for new diagnostic biomarkers, one of the first 
steps is often the identification of significant differences in the 
expression levels of genes or proteins across different biological 
conditions. Commonly used statistical methods for this purpose 
quantify the extent and significance of changes in measures of 
the aver age expression levels of s i ngle genes/proteins [see for 
example ISmvthlj2004l ):l Tusher et al or analyze aggregated 

data for gene/ protein sets representing entire cellular pathways 
and processes jGlaab et all 1201 ut iGuo et all |2005| ; iLee et all 
I2008T) , However, since these approaches compare measures of 
averaged expression levels, they cannot study how the variance of 
expression levels across the genes/proteins of a cellular pathway 
(termed 'pathway expression variance' here) changes under different 
biological conditions. In this article, we present a web application 
for microarray data analysis to identify and prioritize pathways 
with changes in the pathway expression variance across samples 
(unsupervised setting) or predefined sample groups (supervised 
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Fig. 1. Left: box plot comparing the median expression levels in the KEGG 
Urea c^denathwji^ i yisa00220] i jbj^^ 
et a/. H2ol)2l) acro'ss5o"'lieldmTin 

(red); right: box plot comparing the variance of expression levels in the 
same pathway and microarray dataset (see also Supplementary Material). 

setting). In particular, we show example cases on cancer data in 
which significant pathway deregulations manifest themselves in 
terms of changes in the variance of gene/protein expression levels 
in pathways, while no significant changes can be detected in the 
median pathway expression levels (see section 'Results on Cancer 
Microarray Data' and Fig.Q}. Finally, we discuss how the software 
enables automated sample clustering and classification using the 
extracted pathway expression variances. 

2 WORKFLOW AND METHODS 

PathVar identifies and analyzes deregulation patterns in pathway expression 
using two possible analysis modes, a supervised and an unsupervised mode, 
chosen automatically depending on the availability of sample class labels. 

In the first step, the user uploads a pre-normalized, tab-delimited 
microarray dataset and chooses an annotation database to map genes/proteins 
onto cellular pathways and processes (see Section 4). Next, in the supervised 
analysis mode, the software computes two gene/protein set rankings in 
terms of differential pathway expression variance using a parametric 
T-test and a non-parametric Mann-Whitney C/-test (or respectively, an 
F-test and Kruskal-Wallis test for multi-class data). Alternatively, in the 
unsupervised analysis mode, three feature rankings are obtained from the 
pathway expression variance matrix (rows = pathways, columns = samples) 
by computing the absolute variances across the columns/samples, the 
magnitude of the loadings in a sparse principal component analysis 
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\Zou and HastieLl2008l) and a recently proposed entropy score ( Varshavskv 
et al. 120061) . These rankings are combined by computing the sum of 
ranks across the three methods and normalizing the sum-of-ranks scores by 
dividing by the maximum possible score. The resulting sortable ranking table 
of pathways contains the test statistics and significance scores, the number 
and identifiers of the mapped genes/proteins, and buttons to generate box 
plots for each pathway and forward the genes/proteins to other bioscientific 
web services for further analysis. Moreover, a heat-map visualization of the 
expression level variances is provided as output. 

In the next step, the user can forward the extracted pathway variance 
data to a clustering module, for identifying sample groups with similar 
expression variance across multiple pathways, or to a classification module 
(for labelled data), to build models for sample classification. The clustering 
module provides a selection of four hierarchical clustering algorithms, 
three partition-based approaches and one consensus clustering approach 
to combine the results of the individual methods see IGlaab et all \2{W% 
and Supplementary Material. In order to compare the outcome for different 
clustering approaches and identify a number of clusters that is optimal 
in terms of cluster compactness and separation between the clusters, five 
validity indices are computed and aggregated by computing the sum of 
validity score ranks across all methods and numbers of clusters. Moreover, 
the clustering results are visualized using both 2D plots (cluster validity 
score plots, principal component plots, dendrograms and silhouette plots) 
and interactive 3D visualizations using dimensionality reduction methods 
(Supplementary Material). 

For a supervised analysis of the data, the classification module contains six 
diverse feature selection methods a nd six prediction al gorithms, which can 
be combined freely by the user Isee lGlaab et all l20()9l) and Supplementary 
Material]. To estimate the accuracy of the generated classification models, 
the available evaluation schemes include an external n-fold cross-validation 
as well as user-defined training/test set partitions. In addition to the average 
prediction accuracy and SD obtained from these evaluation methods, several 
other performance statistics like the sensitivity and specificity, and Cohen's 
Kappa statistic are computed. Additionally, a Z-score estimate of each gene 
set's utility for sample classification is determined from the frequency of its 
selection across different cross-validation cycles, and a heat map is generated 
to visualize the expression variance for the most informative gene sets. All 
machine learning techniq ue implemen t ations stem from a fully automated 
data analysis framework IGlaab et all i2009|). which has previously been 
employed in variety of bios cientific studies jBassel et a7ll2() 1 ltlGlaab et all 
l201CtlHabashv et a/lEoTl . 

To alleviate statistical limitations resulting from incomplete mappings of 
genes/proteins onto pathways and from multiple hypothesis testing, only 
pathways with a minimum of 10 mapped identi fiers are considered in all 
analys es and p- values are adjusted according to lBeniamini and Hochbergl 
d 19951) (see section on limitations in the Supplementary Material for details 
and advice). 



3 RESULTS ON CANCER MICROARRAY DATA 

The microarray prostate cancer dataset bv lSinghef all <2002h . containing 52 
tumor samples and 50 healthy control samples, is a typical example for a 
cancer-related high-throughput dataset with gene expression deregulations 
across many cellular pathways. When analyzing this data using both a 
comparison of median gene expression levels in KEGG pathways across 
the sample classes, and a comparison of the expression level variances 
with PathVar, the top-ranked pathway in terms of differential expression 
variance, Urea cycle and metabolism of amino groups (hsa00220), showed 
a significant increase of the variance in the tumor samples (see Fig. \l\ 
right; adjusted P-value: 2.2e-06). Interestingly, a conventional comparison 
of the corresponding median gene expression levels does not identify 
statistically significant differences between the sample groups (Fig.[T] left). 
Similar results were obtained for other cancer-associated KEGG pathways, 
including the angiogenesis-related VEGF signaling pathway (hsa04370) 
and the inflammation-related Natural killer cell mediated cytotoxicity 



(hsa04650) process. Corresponding statistics and box plots are provided in 
the Supplementary Material, which also contains results from the clustering 
module and the classification module, similar outputs for a further microarray 
study, as well as details on the used data and normalization procedures. In 
summary, PathVar identifies statistically significant pathway deregulations, 
different from those detected by methods for comparing averaged expression 
levels, and provides pathway-based clustering and classification models that 
enable a new interpretation of microarray data. 

4 IMPLEMENTATION 

All data analysis procedures were implemented in the R statistical 
programming language and made accessible via a web interface written 
in PHP on an Apache web server. Gene and protein sets representing 
cellular pathways and processes were retrieved fro m the databases KEGG 
jKanchisa et q/.[|2008t). BioCarta <Nishimuratl200lh, Reactome (Joshi-Top e 
et al.. 120051) , NCI Pathway Int eraction D atabase JSchaefer et all l2009h . 
WikiPathways <Pico et q/.Ll2008l). InterPro lApweiler et q/.U200ll) and Gene 
Ontology IGOSlim. lAshburner et all 120001) 1 and will be updated on a regular 
basis. A detailed tutorial for the software is provided on the web page. 
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